
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>

<script type="text/javascript" charset="utf-8" src="https://ajax.googleapis.com/ajax/libs/jquery/1.3.2/jquery.min.js"></script> 
<!---
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
--->


<style type="text/css">
body {
    font-family: "Titillium Web", "HelveticaNeue-Light", "Helvetica Neue Light", "Helvetica Neue", Helvetica, Arial, "Lucida Grande", sans-serif;
    font-weight: 300;
    font-size: 20px;
    margin-left: auto;
    margin-right: auto;
}

@media screen and (min-width: 980px){
    body {
        width: 980px;
    }
}


h1 {
    font-weight:300;
    line-height: 1.15em;
}

h2 {
    font-size: 1.75em;
}
a:link,a:visited {
    color: #5364cc;
    text-decoration: none;
}
a:hover {
    color: #208799;
}
h1 {
    text-align: center;
}
h2,h3 {
    text-align: left;
}

h1 {
    font-size: 40px;
    font-weight: 500;
}
h2 {
    font-weight: 400;
    margin: 16px 0px 4px 0px;
}
h3 {
    font-weight: 600;
    margin: 16px 0px 4px 0px;
}

.paper-title {
    padding: 1px 0px 1px 0px;
}
section {
    margin: 32px 0px 32px 0px;
    text-align: justify;
    clear: both;
}
.col-5 {
     width: 20%;
     float: left;
}

.move-down {
    margin-top:1.2cm;
}

.col-4 {
     width: 25%;
     float: left;
}
.col-3 {
     width: 33%;
     float: left;
}
.col-2 {
     width: 50%;
     float: left;
}
.col-1 {
     width: 100%;
     float: left;
}

.author-row, .affil-row {
    font-size: 17px;
}

.author-row-new { 
    text-align: center; 
}

.author-row-new a {
    display: inline-block;
    font-size: 17px;
    padding: 4px;
}

.author-row-new sup {
    color: #313436;
    font-size: 13   px;
    padding: 4px;
}

.affiliations-new {
    font-size: 16px;
    text-align: center;
    width: 80%;
    margin: 0 auto;
    margin-bottom: 20px;
}

.row {
    margin: 16px 0px 16px 0px;
}
.authors {
    font-size: 26px;
}
.affiliatons {
    font-size: 18px;
}
.affil-row {
    margin-top: 18px;
}
.teaser {
    max-width: 100%;
}
.text-center {
    text-align: center;  
}
.screenshot {
    width: 256px;
    border: 1px solid #ddd;
}
.screenshot-el {
    margin-bottom: 16px;
}
hr {
    height: 1px;
    border: 0; 
    border-top: 1px solid #ddd;
    margin: 0;
}
.material-icons {
    vertical-align: -6px;
}
p {
    line-height: 1.25em;
}
.caption {
    font-size: 16px;
    color: #666;
    margin-top: 4px;
    margin-bottom: 10px;
	text-align: left;
}


video {
    display: block;
    margin: auto;
}


figure {
    display: block;
    margin: auto;
    margin-top: 10px;
    margin-bottom: 10px;
}
#bibtex pre {
    font-size: 14px;
    background-color: #eee;
    padding: 16px;
}
.blue {
    color: #2c82c9;
    font-weight: bold;
}
.orange {
    color: #d35400;
    font-weight: bold;
}
.flex-row {
    display: flex;
    flex-flow: row wrap;
    padding: 0;
    margin: 0;
    list-style: none;
}
.flex-row-center {
    display: flex;
    flex-flow: row wrap;
    padding: 0;
    margin: 0;
    list-style: none;
    justify-content: center;
    text-align: center;
}
.flex-container {
  display: flex;
  flex-wrap: wrap;
}

.flex-item {
  flex: 0 0 50%;
  padding: 10px;
  box-sizing: border-box;
}

.paper-btn-coming-soon {
    position: relative; 
    top: 0;
    left: 0;
}

.coming-soon {
    position: absolute;
    top: -15px;
    right: -15px;
}

.center {
  margin-left: 10.0%;
  margin-right: 10.0%;
}

.paper-btn-small {
  position: relative;
  text-align: center;
  vertical-align: center;

  display: inline-block;
  margin: 8px;
  padding: 8px 8px;

  border-width: 0;
  outline: none;
  border-radius: 2px;
  
  background-color: #E0F7FA;
  color: #01579B !important;
  font-size: 20px;
  width: 100px;
  font-weight: 600;
}



.paper-btn {
  position: relative;
  text-align: center;
  vertical-align: center;

  display: inline-block;
  margin: 8px;
  padding: 8px 8px;

  border-width: 0;
  outline: none;
  border-radius: 2px;
  
  background-color: #E0F7FA;
  color: #01579B !important;
  font-size: 20px;
  width: 250px;
  font-weight: 600;
}

.paper-btn-tapestry {
  position: relative;
  text-align: center;

  display: inline-block;
  margin: 8px;
  padding: 8px 8px;

  border-width: 0;
  outline: none;
  border-radius: 2px;
  
  background-color: #5364cc;
  color: white !important;
  font-size: 20px;
  width: 200px;
  font-weight: 600;
}

.paper-btn-parent {
    display: flex;
    justify-content: center;
    margin: 16px 0px;
}

.paper-btn:hover {
    opacity: 0.85;
}

.container {
    margin-left: auto;
    margin-right: auto;
    padding-left: 16px;
    padding-right: 16px;
}

.venue {
    font-size: 23px;
}

.topnav {
    background-color: #EEEEEE;
    overflow: hidden;
}

.topnav div {
    max-width: 1070px;
    margin: 0 auto;
}

.topnav a {
    display: inline-block;
    color: black;
    text-align: center;
    vertical-align: middle;
    padding: 16px 16px;
    text-decoration: none;
    font-size: 18px;
}

.topnav img {
    padding: 2px 0px;
    width: 100%;
    margin: 0.2em 0px 0.3em 0px;
    vertical-align: middle;
}

pre {
    font-size: 0.9em;
    padding-left: 7px;
    padding-right: 7px;
    padding-top: 3px;
    padding-bottom: 3px;
    border-radius: 3px;
    background-color: rgb(235, 235, 235);
    overflow-x: auto;
}

.download-thumb {
    display: flex;
}

@media only screen and (max-width: 620px) {
    .download-thumb {
        display: none;
    }
}

.paper-stuff {
    width: 50%;
    font-size: 20px;
}

@media only screen and (max-width: 620px) {
    .paper-stuff {
        width: 100%;
    }
}
* {
  box-sizing: border-box;
}

.column {
  text-align: center;
  float: left;
  width: 16.666%;
  padding: 5px;
}
.column3 {
  text-align: center;
  float: left;
  width: 33.333%;
  padding: 5px;
}
.column4 {
  text-align: center;
  float: left;
  width: 50%;
  padding: 5px;
}
.column5 {
  text-align: center;
  float: left;
  width: 20%;
  padding: 5px;
}
.column10 {
  text-align: center;
  float: left;
  width: 10%;
  padding: 5px;
}
.border-right {
    border-right: 1px solid black;
}
.border-bottom{
    border-bottom: 1px solid black;
}


.row-center {
    margin: 16px 0px 16px 0px;
    text-align: center;
}

/* Clearfix (clear floats) */
.row::after {
  content: "";
  clear: both;
  display: table;
}
.img-fluid {
  max-width: 100%;
  height: auto;
}
.figure-img {
  margin-bottom: 0.5rem;
  line-height: 1;
}

.rounded-circle {
  border-radius: 50% !important;
}

/* Responsive layout - makes the three columns stack on top of each other instead of next to each other */
@media screen and (max-width: 500px) {
  .column {
    width: 100%;
  }
}
@media screen and (max-width: 500px) {
  .column3 {
    width: 100%;
  }
}

.left-column {
    float: left;
    width: 5%;
    text-align: center;
    vertical-align: center;
}

.right-column {
    float: right;
    width: 95%;
    text-align: center;
    vertical-align: center;
}

</style>

<script type="text/javascript"></script>
    <link href='https://fonts.googleapis.com/css?family=Titillium+Web:400,600,400italic,600italic,300,300italic' rel='stylesheet' type='text/css'>
    <head>
        <title> AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data </title>
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <meta property="og:description" content="A series Code LLMs developed from instruction tuning on harmonized multi-source data."/>
        <link href="https://fonts.googleapis.com/css2?family=Material+Icons" rel="stylesheet">
        <link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Material+Symbols+Outlined:opsz,wght,FILL,GRAD@20..48,100..700,0..1,-50..200" />
        <!-- <link rel="icon" href="https://images.emojiterra.com/google/noto-emoji/unicode-15.0/color/512px/1f9e0.png"> -->
    </head>

 <body>

<div class="container">
    <div class="paper-title">
    <h1> 
        AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data
    </div>

    <div id="authors">
        <center>
            <div class="author-row-new">
                Zifan Song<sup>1,2*</sup>,
                Yudong Wang<sup>2*</sup>,
                <a href="https://zhangwenwei.cn/">Wenwei Zhang<sup>2*</sup></a>,
                Kuikun Liu<sup>2</sup>,
                Chengqi Lyu<sup>2</sup>,
                Demin Song<sup>2</sup>,
                Qipeng Guo<sup>2</sup>,
                Hang Yan<sup>2</sup>,
                <a href="http://dahua.site/">Dahua Lin<sup>2,3</sup></a>,
                <a href="https://chenkai.site/">Kai Chen<sup>2†</sup></a>
                Cairong Zhao<sup>1†</sup>,
            </div>
        </center>
        <center>
            <div class="affiliations">
                <sup>*</sup> Equal Contribution, <sup>†</sup> Corresponding Author 
            </div>
    
        </center>
        <center>
        <div class="affiliations">
            <span><sup>1</sup> Tongji University</span>
            <span><sup>2</sup> Shanghai AI Laboratory</span>
            <span><sup>3</sup> Chinese University of Hong Kong</span>
        </div>

        </center>

        <div>
            <div class="paper-btn-parent">
            <a class="paper-btn" href="https://arxiv.org/abs/2405.19265">
                <span class="material-icons"> description </span> 
                Paper
            </a>
            <a class="paper-btn" href="https://github.com/InternLM/AlchemistCoder">
                <span class="material-icons"> code </span>
                Code
            </a>
            <a class="paper-btn" href="https://huggingface.co/internlm/AlchemistCoder-DS-6.7B">
                <span class="material-icons"> sort </span>
                Models
            </a>
            </div>
        </div>
    </div>
    <section id="abstract"/>
        <h2 style="text-align: center;">Abstract</h2>
        <hr>

        <div class="flex-row" style="width: 75%; margin: 0 auto;">
            <p>
                Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on a single dataset, which may insufficiently elicit the potential of pre-trained Code LLMs. This paper presents AlchemistCoder, a series of Code LLMs with better code generation and generalization abilities fine-tuned on multi-source data. To harmonize the inherent conflicts among the various styles and qualities in multi-source data, we introduce data-specific prompts, termed AlchemistPrompts, inspired by hindsight relabeling, to improve the consistency between instructions and responses. We further propose to incorporate the data evolution process itself into the fine-tuning data to enhance the code comprehension capabilities of LLMs, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.
            </p>
        </div>
    </section>
    
    <section id="teaser-image">
        <center>
            <figure>
                <a>
                    <img width="65%" src="figure/performance_scatter plot.png"> 
                </a>
                <p class="caption", style="text-align: center;">
                    Figure 1. Performance scatter plot (<b>top right</b> is better) of open-source models on mainstream code benchmarks, HumanEval and MBPP.
                </p>
            </figure>
        </center>
    </section>

    <section id="introduction"/>
        <h2 style="text-align: center;">Introduction</h2>
        <hr>

            <p>
                <center>
                "<b><i>Alchemist: Someone Who Transforms Things for the Better.</i></b>" —— Merriam Webster
                </center>
            </p>

                The training of Code LLMs mainly goes through pre-training and fine-tuning stages. Pioneer works have amassed extensive code data for pre-training, while recent open-source models highlight the effectiveness of high-quality or targeted code fine-tuning datasets. Despite these advancements, current fine-tuning methods mainly rely on a particular kind of code-related <B>question-answering</B> dataset, unlike the pre-training stage that integrates code-related corpus from various sources. Such a discrepancy indicates that the fine-tuning data <B>may not be diverse enough</B> to fully stimulate the capabilities of base models, resulting in limited performance, generalization, and robustness.
                
            <p>
                To tackle these challenges, we first explore integrating data from multiple sources and find that directly mixing (e.g., the DirectlyMix-L-7B model in Fig. 1) does not produce the desired effect due to inherent conflict of multi-source data. Therefore, we propose to adopt <B>hindsight relabeling</B> for multi-source data mixing, which designs <B>data-specific prompts</B> to <B>harmonize the inherent conflicts of different data sources</B> so that they can be used together to elicit the performance of base models more sufficiently. We term this form of prompts as <B><I>AlchemistPrompts</I></B>, inspired by the power and definition of <I>Alchemists</I>. Apart from the conventional problem-solution data, we argue that the evolution of code data reflects higher-level capabilities and is also valuable for the learning of Code LLMs. Thus, we decompose the process of <B>data evolution</B> into three tasks incorporated for training, including <B>instruction evolution</B>, <B>data filtering</B>, and <B>code review</B>, enabling further improvements of code comprehension capabilities.            <p>
            <p>
                We conduct extensive experiments with various base models and develop the instruction-tuned <B><I>AlchemistCoder</B></I> series. As shown in Fig. 1, on two mainstream code benchmarks, HumanEval and MBPP, <B><I>AlchemistCoder</B></I> <B>holds a clear lead among all models of the same size (6.7/7B), and rivals or even surpasses larger models (15B/33B/70B)</B>, demonstrating harmonized and formidable code capailities. More surprisingly, <B><I>AlchemistPrompts</I> allow the code corpus also significantly improve the general capability of Code LLMs</B>, as demonstrated by the improvements on MMLU, BBH, and GSM8K.
            </p>
            </p>
            </p>
            </p>
        </div>
    </section>
    
    <section id="AlchemistCoder"/>

    <h2 style="text-align: center;">AlchemistCoder</h2>
    <hr>

    <section id="teaser-image">
        <center>
            <figure>
                <a>
                    <img width="95%" src="figure/overview.png"> 
                </a>
                <p class="caption", style="text-align: center;">
                    Figure 2. Overview for developing <i>AlchemistCoder</i> series. 
                </p>
            </figure>
        </center>
    </section>
     
    <div class="flex-row">
        <p>
            <B>Multi-source Data Construction:</B> To fully harness code LLM capabilities, we gather fine-tuning data from multiple sources and refine instructions' complexity through instruction evolution. Yet, integrating data from diverse sources for instruction tuning presents challenges. Different developers and LLMs offer varied solutions to similar coding questions, leading to diverse response styles and languages. Simply combining data from these sources results in models learning disparate responses, hindering alignment and performance. Therefore, directly mixing multi-source data is not a promising solution and can be detrimental.
        </p>
        <p>
            <B>AlchemistPrompt:</B> To enhance model learning from diverse data, we introduce <I>AlchemistPrompts</I>, tailored meta-prompts to reconcile data conflicts. Inspired by hindsight relabeling, we employ GPT-4 as an <I>Alchemist</I> to generate these prompts, adjusting instructions to match data specifics. For example, if a task involves Python code with a Bellman-Ford algorithm, the prompt might request Python code utilizing dynamic programming. <I>AlchemistPrompt</I> adjustments are minimal yet effective, with optimal performance achieved by incorporating them into just 5% of samples. This approach balances diversity and domain gap, elevating data quality. By retrospectively analyzing responses and reinterpreting them as alternative goals, <I>AlchemistPrompts</I> refine model comprehension and instruction-following capabilities, fostering a more nuanced learning process.
        </p>
        <p>
            <B>Code Comprehension Task:</B> Existing training datasets for Code LLMs primarily center on code generation tasks, providing programming problems and solutions. However, we advocate for expanding beyond this, recognizing the value in the higher-level abilities demonstrated during code data construction. Thus, to enhance Code LLM performance, we introduce three code comprehension tasks related to data construction: instruction evolution, data filtering, and code review.
        </p>
    </section>

    <section id="results"/>
    <h2 style="text-align: center;">Results</h2>
    <hr>

        <p>
            We adopt 9 benchmarks to evaluate our <I>AlchemistCoder</I> series models, including 6 code benchmarks (<B>HumanEval</B>, <B>HumanEval+</B>, <B>MBPP</B>, <B>MBPP+</B>, <B>HumanEval-X</B>, and <B>DS-1000</B>) and 3 mainstream benchmarks (<B>MMLU</B> for multitask language understanding, <B>BBH</B> for comprehensive reasoning, and <B>GSM8K</B> for mathematical ability).
        </p>
        <br>    
        <center>
            <figure>
                <a>
                    <img width="90%" src="figure/python_code_generation_results.png"> 
                </a>
                <p class="caption", style="text-align: center;">
                    Table 1. Performance of <i>AlchemistCoder</i> series on Python code generation benchmarks (HumanEval/HumanEval+ and MBPP/MBPP+). 
                </p>
            </figure>
        </center>
        <br>
        <center>
            <figure>
                <a>
                    <img width="55%" src="figure/generic_results.png"> 
                </a>
                <p class="caption", style="text-align: center;">
                    Table 2. Performance of <i>AlchemistCoder</i> series on mainstream benchmarks for generic capabilities. 
                </p>
            </figure>
        </center>
    </section>   


    <section id="case study"/>
    <h2 style="text-align: center;">Case Study</h2>
    <hr>

        <p>
            The efficacy of <I>AlchemistPrompts</I> is twofold: <B>1) Harmonization between different data sources:</B> <I>AlchemistPrompts</I> generated from the same LLM have similar styles and can bridge the style differences between sources, while the introduction of <I>AlchemistPrompt</I>-customized data, accounting for only 5%, achieves a balance between data diversity and domain gaps; <B>2) Harmonization within instruction-response pairs:</B> As fine-grained and data-specific prompts, <I>AlchemistPrompts</I> are designed to augment instructions with specific programming languages, algorithm concepts, and other code-related information involved in responses, which can refine the alignment within instruction-response pairs and enhance the instruction-following abilities of fine-tuned models.
        </p>
        <br>    
        <center>
            <figure>
                <a>
                    <img width="55%" src="figure/alchemistprompt_case_1.png"> 
                </a>
                <p class="caption", style="text-align: center;">
                    Figure 3. Example #1 of <i>AlchemistPrompts</i>. 
                </p>
            </figure>
        </center>
        <br>
        <center>
            <figure>
                <a>
                    <img width="55%" src="figure/alchemistprompt_case_2.png"> 
                </a>
                <p class="caption", style="text-align: center;">
                    Figure 4. Example #2 of <i>AlchemistPrompts</i>. 
                </p>
            </figure>
        </center>
        <br>

    </section>   



    <section>
        <hr>
        This webpage template was recycled from <a href='https://nv-tlabs.github.io/LION/'>here</a>.
        <!-- <center><p><a href='https://accessibility.mit.edu/'><b>Accessibility</b></a></p></center> -->
    </section>

</div>
</body>
</html>
